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Abstract — One way to find near-matches in large datasets is 
to use hash functions [7], [16]. In recent years locality-sensitive 
hash functions for various metrics have been given; for the 
Hamming metric projecting onto k bits is simple hash function 
that performs well. 

In this paper we investigate alternatives to projection. For 
various parameters hash functions given by complete decoding 
algorithms for error-correcting codes work better, and asymp- 
totically random codes perform better than projection. 



I. Introduction 

Given a set of M n-bit vectors, a classical problem is to 
quickly identify ones which are close in Hamming distance. 
This problem has applications in numerous areas, such as 
information retrieval and DNA sequence comparison. The 
nearest-neighbor problem is to find a vector close to a given 
one, while the closest-pair problem is to find the pair in the set 
with the smallest Hamming distance. Approximate versions of 
these problems allow an answer where the distance may be a 
factor of (1 + e) larger than the best possible. 

One approach ([7], [12], [16]) is locality-sensitive hashing 
(LSH). A family of hash functions H is called (r, cr,pi,p2)- 
sensitive if for any two points x, y G V, 

. if <i(x,y) < r, then Prob(/i(x) = h(y)) > pi, 
• if d(x,y) > cr, then Prob(/i(x) = h(y)) < p2- 

Let p — log(l/pi)/ log(l/p2)- An LSH scheme can be 
used to solve the approximate nearest neighbor problem for 
M points in time 0(M P ). Indyk and Motwani [14] showed 
that projection has p = 1/c. 

The standard hash to use is projection onto k of the n 
coordinates [12]. An alternative family of hashes is based 
on minimum- weight decoding with error-correcting codes [5], 
[20]. A [n, k] code C with a complete decoding algorithm 
defines a hash h c , where each v G V := F£ is mapped to the 
codeword c G C C V to which v decodes. Using linear codes 
for hashing schemes has been independently suggested many 
times; see [5], [10], and the patents [4] and [20]. 

In [5] the binary Golay code was suggested to find approx- 
imate matches in bit-vectors. Data is provided that suggests it 
is effective, but it is still not clear when the Golay or other 
codes work better than projection. In this paper we attempt to 
quantify this, using tools from coding theory. 

Our model is somewhat different from the usual LSH 
literature. We are interested in the scenario where we have 
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collection of M random points of V, one of which, x, has 
been duplicated with errors. The error vector e has each bit 
nonzero with probability p. Let (p) be the probability that 
h c (x) — h c (x + e). Then the probability of collision of two 
points x and y is 

« if y = x + e, then Prob(/i(x) = h(y)) = Pi = P°(p), 

. if y ^ x + e, then Prob(/i(x) = h\y)) = p 2 = 2~ k . 

Then the number of elements that hash to h(x) will be about 
M /2 fe , and the probability that one of these will be y = x + e 
is F^{p). If this fails, we may try again with a new hash, say 
the same one applied after shifting the M points by a fixed 
vector, and continue until y is found. 

Let p = log(l/pi)/log(l/p 2 ) as for LSH. Taking 2 fc w M, 
we expect to find y in time 

M 

2 k P c (p) 

As with LSH, we want to optimize this by minimizing p, i.e. 
finding a hash function minimizing (p) . 

For a linear code with a complete translation-invariant 
decoding algorithm (so that /i(x) = c implies that h(x.+c') = 
c + c'), studying is equivalent to studying the properties of 
the set S of all points in V that decode to 0. In Section ITTT1 and 
the appendix we systematically investigate sets of size < 64. 

Suppose that we pick a random x G S. Then the probability 
that y = x + e is in S is 
1 



Psip) 



\S\ 



E 

x,yS5 



(1) 



This function has been studied extensively in the setting 
of error-detecting codes [17]. In that literature, S is a code, 
Ps (p) is the probability of an undetected error, and the goal is 
to minimize this probability. Here, on the other hand, we will 
call a set optimal for p if no set in V of size |<S| has greater 
probability. 

As the error rate p approaches 1/2, this coincides with 
the definition of distance-sum optimal sets, which were first 
studied by Ahlswede and Katona [1]. 

The error exponent of a code C is 



£ c (p) = -ilgP c (p). 



In this paper lg denotes log to base 2. We are interested in 
properties of the error exponent over codes of rate R — k/n 
as n — > oo. Note that p = E c (p)/R, so minimizing the error 
exponent will give us the best code to use for finding closest 
pairs. In Section [IV] we will show that hash functions from 
random (nonlinear) codes have a better error exponent than 
projection. 
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II. Hash Functions From Codes 
For a set S C V, let 

Ai = #{(x, y) : x, y E S and d(x, y) = i} 

count the number of pairs of words in S at distance i. The 
distance distribution function is 



A(5,C) :=X>C*- 



i=0 



This function is directly connected to Ps(p) [17]. If x is 
a random element of S, and y = x + e, where e is an error 
vector where each bit is nonzero with probability p, then the 
probability that y G S is 

Ps(p) := ||| ]T /(*,y) (1 _ p) «-d(x,y) (3) 



i=Q 
(l-p)» 



15 



-A S 



P 



In this section we will evaluate (fj) for projection and for 
perfect codes, and then consider other linear codes. 

A. Projection 

The simplest hash is to project vectors in V onto k coor- 
dinates. Let k-projection denote the [n, k] code V n .k corre- 
sponding to this hash. The associated S of vectors mapped to 
is an 2™~ fc -subcube of V. The distance distribution function 
is 

A(S,C) = (2(l + C))"- fc , (4) 
so the probability of collision is 



*>„.^_(1-P) n ( 2 



(1-pY 



(5) 



2 n-fe yi-p 

V n .k is not a good error-correcting code, but for sufficiently 
small error probability its hash function is optimal. 

Theorem 1: Let S be the 2" _fc -subcube of V. For any error 
probability p e (0, 2- 2 ("- fc )), S is an optimal set, and so k- 
projection is an optimal hash. 

Proof: The distance distribution function for S is 

A(S,C) = 2 n ~ k (l + () n - k . 

The edge isoperimetric inequality for an n-cube [13] states 
that 

Lemma 2: Any subset S of the vertices of the n-dimen- 
sional cube Q n has at most 

^|lg|S| 

edges between vertices in S, with equality if and only if S is 
a subcube. 

Any set S' with 2™~ fc points has distance distribution 
function 



k 

E 

i=0 



where c = 2™ k , ci < (n — k)2 n k by Lemma [2] and the 
sum of the Cj's is 2 2< > n ~ k \ By (|5]l the probability of collision 

is (1 -p) n 2 n ~ k A(S',p/(l -p)). 

A(S', C) < 2™- fc + C((n - k)2 n - k - 1) 

+C 2 (2 2 ("- fe > - (n - fc + l)2"- fe + l) , 



(2) and 



A(S,0-A(S>,{) 

> C - C 2 (2 2 <"- fc ) + 2"- fc - x (n-k 2 +n-k + 2) + l) 

> C-^ 2 (2 2 ("- fe ) - 1). 

This is positive if p < 1/2 and (1 -p)/p> 2 2( ™~ fe ) - 1, i.e., 
forp < 2~ 2 (™- fc ). ■ 



fi. Concatenated Hashes 

Here we show that if h and /i' are good hashes, then the 
concatenation is as well. First we identify C with F* and treat 
h c as a hash h from F|. We denote r by r . From 

/i : FJ — ► F* and /i' : F? — > F§ , we get a concatenated hash 



n+n 



F. 



(/i, /i') : f: 

Lemma 3: Fix p E (0, 1/2). Let h and h' be hashes. Then 

mm{E ll (p),E h '(p)} < E (h ' h " ,{p) < max{E /l (p), E h ' (p)} , 

with strict inequalities if E h (p) ^ E' 1 (p). 

Proof: Since p is fixed, we drop it from the notation. 
Suppose E ft < E' 1 '. Then 

lgP^ < lgP^+lgP^ < lgP^ 

n — n + n' ~ n 1 

Since P (Ml,) = P h P^', we have E h < E^ h ' h '^ < E 11 ' . ■ 

C. Perfect Codes 

An e-sphere around a vector x is the set of all vectors y 
with d(x,y) < e. An [n,k,2e + 1] code is perfect if the e- 
spheres around codewords cover V. Minimum weight decod- 
ing with perfect codes is a reasonable starting point for hashing 
schemes, since all vectors are closest to a unique codeword. 
The only perfect binary codes are trivial repetition codes, the 
Hamming codes, and the binary Golay code. Repetition codes 
do badly, but the other perfect codes give good hash functions. 

1) Binary Golay Code: The [23, 12, 7] binary Golay code Q 
is an important perfect code. The 3-spheres around each code 
codeword cover F 23 . The 3-sphere around in the 23-cube 
has distance distribution function 

2048 + 11684C + 128524C 2 + 226688C 3 

+ 1133440C 4 + 672980C 5 + 2018940C 6 . 

From this we find E 5 (p) > E V23 - 12 (p) for p e (0.2555, 1/2). 
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TABLE I 

Crossover error probabilities p for Hamming codes H„ 



m k 


V 


4 11 

5 26 

6 57 

7 120 


0.2826 
0.1518 
0.0838 
0.0468 



2) Hamming Codes: Aside from the repetition codes and 
the Golay code, the only perfect binary codes are the Hamming 
codes. The [2 m — 1, 2 m — m— 1, 3] Hamming code TL m corrects 
one error. 

The distance distribution function for a 1 -sphere is 

2 m + 2(2 m - 1)C + (2 m - l)(2 m - 2)C 2 , 
so the probability of collision P Hm (p) is 



(6) 



.( 2 m + 2(2 m - 1)- 



V 



1-p 

+ (2 m -l)(2 m -2) 



(7) 



(1-P) 2/ 

Table [Ogives the crossover error probabilities where the first 
few Hamming codes become better than projection. 

Theorem 4: For any m > 4 and p > m/(2' m — m), the 
Hamming code 7i m beats (2 m — m — l)-projection. 

Proof: The difference between the distribution functions 
of the cube and the 1 -sphere in dimension 2 m — 1 is 



/ m (C) := A(S,0-A(H m> 
= 2 m (l + C) m 

-(2 m + 2(2 m - 1)C + (2 m - l)(2 m - 2)C 2 ) 



(8) 



We will show that, for m > 4, f m (0 has exactly one root in 
(0, 1), denoted by a m , and that a m € ((m - 2)/2 m ,ra/2 m ). 
We calculate 



/m(C) = (("» - 2)2 m + 1)C 
> 2 <" - ( :■>, + ( m 



2 C 2 



+2' 



! = 3 



All the coefficients of / m (C) ^ non-negative with the excep- 
tion of the coefficient of £ 2 , which is negative for m > 2. 
Thus, by Descartes' rule of signs /(£) has or 2 positive 
roots. However, it has a root at £ = 1. Call the other positive 
root a m . We have f m (0) — /m(l) = 0, and since /'(0) = 
(m-2)2 m + 2 > and /'(l) = 2 2m - 1 (m-4) + 2 m + 2 -2 > 
for m > 4, we must have a m < 1 for m > 4. 

For p > a m the Hamming code 7i TO beats projection. 

Using dH) and Bernoulli's inequality, it is easy to show that 
/m(C) > for C < c(m - 2)/2 m for any c < 1 and m > 4. 
For the other direction, we may use Taylor's theorem to show 

^4 „ ^ v TO _2 



1 



< 2 r ' 



1 



2 m J ~ ' 2 m +i V 2 r ' 

Plugging this into (O, we have that f m (m/2 m ) < for m > 
6. 
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Fig. 1. Crossover error probabilities for minimum length linear codes. 



D. Other Linear Codes 

The above codes give hashing strategies for a few values of 
n and k, but we would like hashes for a wider range. For a 
hashing strategy using error-correcting codes, we need a code 
with an efficient complete decoding algorithm; that is a way to 
map every vector to a codeword. Given a translation invariant 
decoder, we may determine S, the set of vectors that decode 
to 0, in order to compare strategies as the error probability 
changes. 

Magma [6] has a built-in database of linear codes over F2 
of length up to 256. Most of these do not come with effi- 
cient complete decoding algorithms, but magma does provide 
syndrome decoding. Using this database new hashing schemes 
were found. For each dimension k and minimum distance d, 
an [n,k,d] binary linear code with minimum length n was 
chosen for testingQ (This criterion excludes any codes formed 
by concatenating with a projection code.) For each code there 
is an error probability above which the code beats projection. 
Figure Q] shows these crossover probabilities. Not surprisingly, 
the [23, 12, 7] Golay code Q and Hamming codes Ti^ and 
H5 all do well. The facts that concatenating the Golay code 
with projection beats the chosen code for 13 < k < 17 and 
concatenating Ti m with projection beats the chosen codes for 
27 < k < 30 show that factors other than minimum length 
are important in determining an optimal hashing code. 

As linear codes are subspaces of Fjj, lattices are subspaces 
of R". The 24-dimensional Leech lattice is closely related to 
the Golay code, and also has particularly nice properties. It 
was used in [2] to construct a good LSH for R". 

III. Optimal Sets 

In the previous section we looked at the performances 
of sets associated with various good error-correcting codes. 
However, the problem of determining optimal sets S C F 2 ' is 
of independent interest. 

The general question of finding an optimal set of size 2* 
in V for an error probability p is quite hard. In this section 
we will find the answer for t < 6, and look at what happens 
when p is near 1/2. 

'The magma call BLLC (GF (2) , k, d) was used to choose a code. 
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A. Optimal Sets of Small Size 

For a vector x = (x%, . . . , x n ) G V, let 7j(x) be x with the 
i-th coordinate complemented, and let s-y(x) be x with the 
i-th and j-th coordinates switched. 

Definition 5: Two sets are isomorphic if one can be gotten 
from the other by a series of Ti and transformations. 

Lemma 6: If S and S' are isomorphic, then Ps{p) — 
P s >(p) for ah> 6 [0,1]. 

The corresponding non-invertible transformation are: 



We have 



pi(x) := (xi,X2,-.-,Xi-i,0,Xi + i,...x n ) 



(9) 



CTy(x) := 



/(*), 



= 1. 



Definition 7: A set 5 C V is a down-set if /Jj(«S) C S for 
all i < n. 

Definition 8: A set S C V is right-shifted if o~ij(S) C 5 
for all i, j < n. 

Theorem 9: If a set S is optimal, then it is isomorphic to 
a right-shifted down-set. 

Proof: We will show that any optimal set is isomorphic 
to a right-shifted set. The proof that it must be isomorphic to 
a down-set as well is similar. A similar proof for distance-sum 
optimal sets (see Section ITH-Bb was given by Kiindgen in [18]. 

Recall that 



Ps(p) = 



(1-P) T 
151 



. £ c <*(x, y)) 

x,yGS 



where £ = p/(l — p) G (0, 1). If >S is not right-shifted, there 
is some x G S with Xi = 1, xj = 0, and i < j. Let if ij(S) 
replace all such sets x with Sjj(x). We only need to show that 
this will not decrease Ps(p)- 

Consider such an x and any y G S. If yi — yj, then 
d(x,y) = d(sy(x),y), and Ps(p) will not change. If yi = 
and t/j = 1, then <i(x, y) = <i(sy(x),y) — 2, and since 
C'~ 2 > C'> th at term's contribution to Ps(p) increases. 

Suppose yi = 1 and y 3 = 0. If Sij (y) G 5, then d(x, y) + 

rf( x 5 sy(y)) = d(sij(*),y) + d ( s ij( x )> s ij(y)). and Ps(p) is 

unchanged. Otherwise, <fij(S) will replace y by Sy (y), and 
<i(x,y) = d(s^ (x), (y)) means that Ps{p) will again be 
unchanged. ■ 

Let i? Sj „ denote an optimal set of size s in F^. By 
computing all right-shifted down-sets of size 2', for t < 6, 
we have the following result: 

Theorem 10: The optimal sets i?2*,n f° r t G {1, ...,6} 
correspond to Tables [IV] [pg. [71 and [V] [pg. [8). 

These figures, and details of the computations, are given the 
Appendix. Some of the optimal sets for t = 6 do better than 
the sets corresponding to the codes in Figure Q] 



B. Optimal Sets for Large Error Probabilities 

Theorem Q] states that for any n and k, for a sufficiently 
small error probability p, a 2 n ~ fc -subcube is an optimal set. 
One may also ask what an optimal set is at the other extreme, 
a large error probability. In this section we use existing results 
about minimum average distance subsets to list additional sets 
that are optimal as p — > l/2~. 



Ps(p) := | 5. 



1-p 



Letting p — 1/2 — e and s = \S\, Ps{j) becomes 
s-'Y A (1/2 -a)' (1/2 + e) n - i 

- ^ (E, ^ + e (E 4 - 2i )^) + 
-^(i + 2 ^)-^E,^ + °( £2 )- 

Therefore, an optimal set for p — > l/2~ must minimize the 
distance-sum of 5 

d(5) := 1 E d (^y) = lj2Mi- (10) 

x,yG5 

Denote the minimal distance sum by 

f(s,n) := mm{d(S) : S G F$, \S\ = s} . 

If d(S) = f(s,n) for a set S of size s, we say that S is 
distance-sum optimal. The question of which sets are distance- 
sum optimal was proposed by Ahlswede and Katona in 1977; 
see Kiindgen [18] for references and recent results. 

This question is also difficult. Kiindgen presents distance- 
sum optimal sets for small s and n, which include the ones of 
size 16 from Table HVl Jaeger et al. [15] found the distance- 
sum optimal set for n large. 

Theorem 11: (Jaeger, et al. [15], cf. [18, pg. 151]) For 
n > s — 1, a generalized 1 -sphere (with s points) is distance- 
sum optimal unless s G {4, 8} (in which case the subcube is 
optimal). 

From this we have: 

Corollary 12: For n > 2* — 1, with t > 4 and p sufficiently 
close to 1/2, a (2* — 1) -dimensional 1-sphere is hashing 
optimal. 

IV. Hashes from Random Codes 

In this section we will show that hashes from random 
codes under minimum weight decoding^ perform better than 
projection. Let R = k/n be the rate of a code. The error 
exponent for k -projection, E Vn - h (p), is 



1 



1 



-\gV n ^( P ) = —\g(l-pf = -R\g(l-p). (11) 
n n 

Theorem |4] shows that for any p > there are codes with 
rate R w 1 which beat projection. For any fixed R, we will 
bound the expected error exponent for a random code 1Z of 
rate R, and show that it beats (fTTT i. 

Let H be the binary entropy 



H(S) 



-6\gS-(l-S)lg(l-S) 



(12) 



Fix 6 G [0, 1/2). Let d := L<^J' l et <$i( x ) denote the sphere 
of radius d around x, and let V(d) := |<Sd(x)|. 

2 Ties arising in minimum weight decoding are broken in some unspecified 
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It is elementary to show (see [11], Exercise 5.9): 
Lemma 13: Let TZ be a random code of length n and rate 
R, where n is sufficiently large. For c £ TZ, the probability 
that a given vector x £ <Sd(c) is closer to another codeword 
than c is at most 

2n(H(S)-l+R) 

Lemma Qj] implies that if H{5) < 1 - R (the Gilbert- 
Varshamov bound), then with high probability, any given 
x £ Sd{c) will be decoded to c. For the rest of this section 
we will assume this bound, so that Lemma [T3l applies. 

Let P n {p) be the probability that a random point x and 
x + e both hash to c. This is greater than the probability that 
x + e has weight exactly d, so 

Theorem 4 of [3] gives a bound for this: 

Theorem 14: For any e < 1/2 and S such that H(5) < 
1 - R and e < 25, 

-E n (p) > elgp+(l-e)lg(l-p) 



Therefore f'(p) = when e„ 
we find 



pH(S). From Theorem [T4l 



for any e < 1/2. The right hand side is maximized at e n 
satisfying 

(25-£ maX )(2(l-5)-£ m a X ) (I-P) 2 



Define 



P 



D{p, 6, e) := e lgp + (1 - e) lg(l -p)+5H (^) 



,2(1-5), 
-(l-#(<5))lg(l-p). 

Then E v ^."(p) - E n {p) > D(p,S,e). 

Theorem 15: D(p,6,e max ) > for any S,p £ (0, 1/2). 
Pwtf- Fix S £ (0, 1/2), and let f(p) := D{p, 5, £ max ). 
It is easy to check that: 

lim f(p) = 0, 
lim /(p)=0, 

p-^l/2- 

lim /'(p) >0, 
lim /'(p)<0, 

Therefore, it suffices to show that f'(p) has only one zero in 
(0, 1/2). Observe that £ max is chosen so that ^-(S,p, £ max ) = 
0. Hence 

f'(p) = ^(<5,P,£max) 

£max 1 ^max 



i - gffl 

plog(2) (l-p)lg(2) + (l-p)log(2) ; 



so 



p = 



45(1 - S) - H(Sf 
2(H(6) HW) 



log(2)/'(p) 



£ max 1 — £ max 1 — H(S) 
p 1 — p 1 — p 



Thus we have E Vn - k (p) > E n (jp), and so: 

Corollary 16: For any p £ (0,1/2), R £ (0,1) and n 
sufficiently large, the expected probability of collision for a 
random code of rate R is higher than projection. 
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Appendix 

By Theorem |£l we may find all optimal sets by examining 
all right-shifted down-sets. Right-shifted down-sets correspond 
to ideals in the poset whose elements are in F!, 1 and with partial 
order x ^ y if x can be obtained from y by a series of 
(O and &ij ([Tol l operations. It turns out that there are not too 
many such ideals, and they may be computed efficiently. 

Our method for producing the ideals is not new, but since 
the main references are unpublished, we describe them briefly 
here. In Section 4.12.2 of [19], Ruskey describes a procedure 
Genldeal for listing the ideals in a poset V. Let jx denote all 
the elements ^< x, and fx denote all the elements >^ x. 

procedure GenIdeal(Q: Poset, /: Ideal) 

local x: PosetElement 

begin 

if Q = 4> then Printlt(J); 
else 

x := some element in Q; 
GenIdeal(Q - jx, I U jx); 
GenIdeal(Q- fx, I); 

end 

The idea is to start with I empty, and Q = V. Then for 
each x, an ideal either contains x, in which case it will be 
found by the first call to Genldeal, or it does not, in which 
case the second call will find it. 

Finding fx and jx may be done efficiently if we precompute 
two \V\ x \V\ incidence matrices representing these sets for 
each element of V. This precomputation takes time 0(\V\ 2 ), 
and then the time per ideal is Od^l). This is independent 
of the choice of x. Squire (see [19] for details) realized that, 
by picking x to be the middle element of Q in some linear 
extension, the time per ideal can be shown to be (9(lg \P\). 

We are only interested in down-sets that are right-shifted 
and also are of fairly small size. The feasibility of our com- 
putations involves both issues. In particular, within Genldeal 
we may restrict to x £ with Size(jx) no more than the 
target size of the set we are looking for. If we were using 
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TABLE II 

Number of right-shifted down-sets 



size 


number 


2 


1 


3 


1 


4 


2 


5 


2 


6 


3 


7 


4 


8 
9 


6 
7 


10 


10 



size 


number 


11 


13 


12 


18 


13 


23 


14 


31 


15 


40 


16 


54 


17 


69 


18 


91 


19 


118 


20 


155 



size 


number 


21 


199 


22 


260 


23 


334 


24 


433 


32 


3140 


48 


130979 


64 


4384627 



TABLE III 

OPTIMAL RIGHT-SHIFTED DOWN-SETS Rfj4,n BEATING KNOWN CODES. 

(There are no such down-sets R 2 t n for t < 5.) 



k 


n 


cross 


7?64, 


n 




6 


12 


0.487 


(2 11 


2 1U + 2 5 


,3-2*} 


7 


13 


0.470 


(2 12 


2 10 + 2 4 


,3 -2 8 ) 


8 


14 


0.439 


(2 13 


+ 2 2 , 2 13 


+ 3,2 3 + 2 2 + 1) 


9 


15 


0.391 


<2 14 


+ 3, 2 10 • 


f 2 2 > 


16 


22 


0.244 


(2 21 


+ 2> 




17 


23 


0.242 


<2 22 


+ 1,2 19 - 


f 2) 


18 


24 


0.238 


(2 23 


+ 1,2 17 - 


f 2) 


19 


25 


0.231 


<2 24 


+ 1,2 15 - 


f2) 


20 


26 


0.222 


(2 25 


+ 1,2 13 - 


f 2) 


21 


27 


0.212 


(2 26 


+ 1,2" - 


f2> 



Genldeal with the poset whose ideals correspond to down- 
sets of size 64 in F% 3 , there would be 83, 278, 001 such x to 
consider. However, for our situation with right-shifted down- 
sets, there are only 257 such x and the problem becomes 
quite manageable. Furthermore, instead of stopping when Q 
is empty, we stop when / is at or above the desired size. 

Table [TT] gives the number of right-shifted down-sets of 
different sizes. The computation for size 32 sets took just over 
a second on one processor of an HP Superdome. Size 64 sets 
took 23 minutes. Let R s n refer to an optimal set of size s in 
Fg. Tables UV] and [V] list R 2 t <n for all t < 6 and all n< 2*. 

Several features of Tables [IV] and [V] require explanation. 
First we identify the binary expansion x — 2~2i< n ^ x n~i 
with the vector x = (x\, . . . ,x n ). Second, for each optimal 
right-shifted down-set i?2*, n we have listed a minimal set 
of generators. For example (2 4 — 1) corresponds to the 4- 
dimensional cube while (2 14 ), as a subset of F^ 5 , corresponds 
to the 15-dimensional 1 -sphere. 

For each set p C ross indicates the crossover value for p at 
which point that set performs better than any preceding entry 
in the table. For example, the 4-dimensional cube (2 4 — 1) 
is optimal for all p £ (0,0.5) if 4 < n < 11 but is only 
optimal for p £ (0, 0.4560) if n = 12. For (t, n) = (4, 13), 
the 4-dimensional cube is optimal for p £ (0, 0.3929) while 
the right-shifted down-set (2 12 ,2 2 + 1) is optimal for p £ 
(0.3929, 0.5). 

There are several specific (t, n) for which more than two 
nonisomorphic right-shifted down-sets are optimal. In several 
cases the nonisomorphic optimal right-shifted down-sets have 
the same distance distribution. (The two nonisomorphic sets 
i?2 4 ,i2 were originally found by Kiindgen [18, pg. 160: 



Table 1].) In other cases different sets are optimal for dif- 
ferent values of p. (Such cases are highlighted with a box 
|~|.) For example, with (t,n) = (5,19), the 5-dimensional 
cube (2 5 - 1) is optimal for p £ (0, 0.2826), (2 15 + 1} is 
optimal on (0.2826, 0.3333), while (2 18 ,2 12 + 1} is optimal 
on (0.3333, 0.5). Somewhat similar situations involve t = 6 
and n £ {19, 28, 29, 35, 36, 37, 38, 58, 59}0 For t < 6 and for 
any n, there are at most three different optimal sets. 

Some of the optimal sets Rqa :U are better than those for any 
known hash function. Table [Ell] gives the best known sets for 
each k, and their generators. 

Tilings of binary spaces have also been studied [8]. Indeed 
a complete translation-invariant decoding algorithm leads to 
a tiling of the rt-cube. Recently the second author and Cop- 
persmith [9] have shown that none of these optimal sets are 
associated to tilings. 
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TABLE IV 

Optimal right-shifted down-sets R 2 t n (t < 5). 
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TABLE V 

OPTIMAL RIGHT-SHIFTED DOWN-SETS Re4,n (t = 6) 



LlisUuK\' Jislrihulion lunclkin 



64 + 384k + 960k 2 + 1280k 3 + 960k 4 + 384k + 64 

64 + 228x + 1092k 2 + 1020a; 3 + 1692k 4 

64 + 226k + 1086k 2 + 1100k 3 + 1620k 4 

64 + 250k + 1002k 2 + 1508k 3 + 1032k 4 + 240k 5 

64 + 248k + 1024k 2 + 1592k 3 + 992k 4 + 176k 5 

4(1 + k) 2 (16 + 32k + 184k 2 + 24k 3 ) 

4(1 + k) 2 (16 + 30k + 210k 2 ) 

64 + 232k + 1184k 2 + 1784k 3 + 832k 4 

64 + 224k + 1240k 2 + 1752k 3 + 816k 4 

64 + 216k + 1320k 2 + 1704k 3 + 792k 4 

64 + 208k + 1424k 2 + 1640k 3 + 760k 4 

64 + 206k + 1426k 2 + 1680k 3 + 720k 4 

64 + 204k + 1440k 2 + 1716k 3 + 672k 4 

64 + 202k + 1466k 2 + 1748k 3 + 616k 4 

64 + 200k + 1504k 2 + 1776k 3 + 552k 4 

64 + 198k + 1554k 2 + 1800k 3 + 480k 4 

2(1 + k)(32 + 70k + 722k 2 + 200k 3 ) 

64 + 196k + 1616k 2 + 1820k 3 + 400k 4 

2(1 + k)(32 + 68k + 768k 2 + 156k 3 ) 



o 

0.487 
0.470 
0.439 
0.391 
0.333 
0.283 

0.36 
0.277 
0.263 
0.244 
0.242 
0.238 
0.231 
0.222 
0.212 
0.199 

0.25 
0.186 

0.333 
0.174 
0.163 
0.152 
0.1538 
0.1537 
0.153 
0.152 
0.151 
0.150 
0.148 
0.146 
0.144 
0.141 
0.139 
0.136 
0.133 
0.130 
0.127 
0.123 
0.120 
0.117 
0.114 
0.110 
0.107 
0.104 
0.101 
0.0978 
0.1047 
0.0946 
0.1179 
0.0920 

0.0891 
0.0864 
0.0838 



64 + 194a; + 1690a; 2 + 1836a; 3 + 312a; 4 
2(1 + ;c)(32 + 66;c + f 
2(1 + ;c)(32 + 64x + i 
2(1 + x)(32 + 62;c + ! 
64 + 182k + 2002a; 2 



- 818a; 2 + 108a; 3 ) 

- 872a; 2 + 56a; 3 ) 

- 930a; 2 ) 

! + 1848a; 3 



64 + 180a; + 2016a; 2 

64 + 178a; + 2034a; 2 

64 + 176a; + 2056a; 2 

64 + 174a; + 2082a; 2 

64 + 172a; + 2112x 2 

64 4- 170a; 4- 2146a; 2 

64 + 168a; + 2184a; 2 

64 + 166a; + 2226a; 2 

64 + 164a; 4- 2272a; 2 

64 + 162a; + 2322a; 2 

64 + 160a; 4- 2376a; 2 

64 + 158a; + 2434a; 2 

64 + 156a; + 2496a; 2 

64 + 154a; 4- 2562a; 2 

64 + 152a; + 2632a; 2 

64 + 150a; 4- 2706a; 2 

64 + 148a; 4- 2784a; 2 

64 + 146a; + 2866a; 2 

64 + 144a; 4- 2952a; 2 

64 + 142a; + 3042a; 2 

64 + 140a; 4- 3136a; 2 

64 + 138a; 4- 3234a; 2 

64 + 138a; + 3330a; 2 

64 + 136a; + 3336a; 2 

64 + 136a; + 3440a; 2 

64 + 134a; + 3442 

64 + 132a; + 3552 

64 + 130a; + 3666a; 2 + 236a; 3 

64 + 128a; + 3784a; 2 + 120a; 3 

64 + 126a; + 3906a; 2 



+ 1836a; 3 
+ 1820a; 3 
+ 1800a; 3 
+ 1776a; 3 
+ 1748a; 3 
+ 1716a; 3 
+ 1680a; 3 
+ 1640a; 3 
+ 1596a; 3 
+ 1548a; 3 
+ 1496a; 3 
+ 1440a; 3 
+ 1380a; 3 
+ 1316a; 3 
+ 1248a; 3 
+ 1176a; 3 
+ 1100a; 3 
+ 1020a; 3 
+ 936a; 3 
+ 848a; 3 
+ 756a; 3 
+ 660 
+ 452 
4- 560a; 3 

+ 344a; 3 + 112a; 4 
\x^ + 456a; 3 
+ 348a; 3 



+ 112a; 4 



1) 
2 10 
o 10 



+ 2° 
+ 2 4 
,13 



1 + 2- 
- 1) 



+ 2^ 
+ 3, 
+ 3, 
+ 3 > 

+ 2, 2 10 + 3) 
+ 2, 2 7 + 3) 
+ 2, 2 4 + 3) 
+ 2 > 

+ 1, 2 19 + 2) 
+ 1, 2 17 + 2) 
+ 1, 2 15 + 2) 
+ 1, 2 13 + 2) 
+ 1, 2 11 +2) 



3-2") 
3 ■ 2 8 ) 
+ 3, 2 3 + 5) 



,9 



+ 1 
+ 1 
+ 1 
+ 1 
+ 1 
+ 1 
+ 1 
+ 1) 

, 2 28 + 1) 



+ 3 > 
+ 2 > 
+ 3 > 
2 2 + 1) 
+ 2 > 
+ 3 > 



.7) 



,2^ 



3 2 24 
5 2 23 
j' 2 22 
1 ' 2 21 
2 ' 2 20 
3' 2 19 
1' 2 18 
5 ,17 



3 2 12 
1 ' 2 11 
2' 2 10 
3 2 9 
t[ 2 S 

\2 7 
3 2 6 

r[ 2 3 

7 2 5 
3 , 7 > 



+ 1) 
+ 1) 
+ 1) 
+ 1> 
+ 1> 
+ 1> 
+ 1> 
+ 1) 
+ 1) 
+ 1> 
+ 1) 
+ 1) 
+ 1> 
+ 1> 
+ 1> 
+ 1) 
+ 1> 
+ 1> 
+ 1) 
+ 1) 
+ 1) 
+ 1) 
+ 1, 3 
+ 1) 



2" + 1) 
2 3 + 1) 
3 ■ 2) 
2 2 + 1) 
2 + 1> 



